A bias correction function for classification performance assessment in two-class imbalanced problems
نویسندگان
چکیده
This paper introduces a framework that allows to mitigate the impact of class imbalance on most scalar performance measures when used to evaluate the behavior of classifiers. Formally, a correction function is defined with the aim of highlighting those classification results that present moderately higher prediction rates on the minority class. Besides, this function punishes those scenarios that are biased towards the majority class, but also those that are strongly biased to favor the minority class. This strategy assumes a typical imbalance task, in which the minority class contains the most relevant samples to the research purposes. A novel experimental framework is designed to show the advantages of our approach when compared to the standard use of well-established measures, demonstrating its consistency and validity. Most of traditional learning methods assume that the classes of the problem share similar prior probabilities and/or misclassifi-cation costs. However, in many real-world tasks the ratios of prior probabilities between classes are significantly skewed. This situation is typically known as the imbalance problem. A two-class data set is said to be imbalanced when one of the classes (the minority one) is heavily under-represented regarding the other class (the majority one) [1]. Paradoxically, the minority class is often the most important and usually the one with the highest misclassifi-cation costs. Some typical real-life applications where this problem arises are prediction of microarray gene expression [2], prediction of corporate bankruptcy [3] and credit risk [4], fraud detection in mobile telephone communications [5] and text categorization [6]. Because of examples of the minority and majority classes usually represent the presence and absence of rare cases respectively, they are also referred to as positive and negative examples. As pointed out by many authors [7–10], the use of plain accuracy and/or error rates to evaluate the performance of classifiers in imbal-anced domains might produce misleading conclusions, since they do not take misclassification costs into account, are strongly biased to favor the majority class, and are non-sensitive to class skews. A plethora of alternative scalar and graphical methods have been proposed to properly assess classification performance on imbalanced scenarios. Graphical approaches depict trade-offs between two or more evaluation perspectives, allowing a richer analysis of results but making the comparison of learning algorithms a non-trivial issue. Some well-studied examples are the Receiver Operating Characteristic (ROC) curve [11,12], the Precision–Recall (P–R) curve [13], cost curves [14] and the Bayesian Receiver Operating Characteristic (B-ROC) curve …
منابع مشابه
Improving Imbalanced data classification accuracy by using Fuzzy Similarity Measure and subtractive clustering
Classification is an one of the important parts of data mining and knowledge discovery. In most cases, the data that is utilized to used to training the clusters is not well distributed. This inappropriate distribution occurs when one class has a large number of samples but while the number of other class samples is naturally inherently low. In general, the methods of solving this kind of prob...
متن کاملMine Classification with Imbalanced Data
In binary classification problems it is common for the two classes to be imbalanced: one case is very rare compared to the other. Traditional classification approaches usually ignore this class imbalance, causing performance to suffer accordingly. In contrast, the algorithm infinitely imbalanced logistic regression (IILR) algorithm explicitly addresses class imbalance in its formulation. This p...
متن کاملA comparative study of quantitative mapping methods for bias correction of ERA5 reanalysis precipitation data
This study evaluates the ability of different quantitative mapping (QM) methods as a bias correction technique for ERA5 reanalysis precipitation data. Climate type and geographical location can affect the performance of the bias correction method due to differences in precipitation characteristics. For this purpose, ERA5 reanalysis precipitation data for the years 1989-2019 for 10 selected syno...
متن کاملOn Mining Fuzzy Classification Rules for Imbalanced Data
Fuzzy rule-based classification system (FRBCS) is a popular machine learning technique for classification purposes. One of the major issues when applying it on imbalanced data sets is its biased to the majority class, such that, it performs poorly in respect to the minority class. However many cases the minority classes are more important than the majority ones. In this paper, we have extended ...
متن کاملA Novel One Sided Feature Selection Method for Imbalanced Text Classification
The imbalance data can be seen in various areas such as text classification, credit card fraud detection, risk management, web page classification, image classification, medical diagnosis/monitoring, and biological data analysis. The classification algorithms have more tendencies to the large class and might even deal with the minority class data as the outlier data. The text data is one of t...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Knowl.-Based Syst.
دوره 59 شماره
صفحات -
تاریخ انتشار 2014